Développer un algorithme de classification pour aider les chargés de relation client à décider si un prêt peut être accordé à un client.
Fournir une mesure de l’importance des variables qui ont poussé le modèle à donner une probabilité à un client.
Ci dessous la table des matières :
Pour accorder un crédit, l'entreprise Home Credit doit estimer la solvabilité d'un client. Elle souhaite donc développer une classification automatique de scoring pour décider si un prêt peut être accordé ou non.
Il s'agit d'un problème de classification supervisée. Elle consiste à attribuer une classe à chaque objet à catégoriser, en se fondant sur des données statistiques. Toutes les données sont étiquetées et l'algorithme apprend à prédire le résultat des données d'entrée.
L'algorithme que nous mettrons en place doit permettre de :
Il s'agit en particulier d'identifier les demandeurs à risque en fonction des données disponibles, car le coût de ces demandeurs est très élevé pour l'organisme. En effet, des sommes importantes sont engagées pour chaque prêt.
Une personne physique est dite solvable lorsque la valeur de l’ensemble de ses biens est capable de couvrir les dettes qu’elle contracte.
Pour savoir si un client est capable de régler ses dettes, il faut se procurer des informations sur l’état de sa santé économique. Une enquête de solvabilité consiste à analyser différents indicateurs de rentabilité et de les comparer au chiffre d’affaires et au montant des capitaux propres de son client.
Ces éléments donnent un aperçu de l’état de ses liquidités, de son niveau d’endettement, en somme de sa capacité à couvrir ses dettes et leurs échéances.
Afin d'évaluer la solvabilité d'un client, certains éléments sont à prendre en compte :
Les éléments cités ci-dessus se trouvent dans le jeu de données que nous avons à notre disposition.
import joblib
from sklearn.metrics import *
from sklearn import preprocessing
from sklearn import model_selection
from sklearn import dummy
from sklearn import impute
from sklearn import metrics
from sklearn import linear_model
from sklearn import svm
from sklearn import ensemble
from sklearn import cluster
from sklearn import decomposition
from sklearn import feature_selection
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from imblearn.pipeline import Pipeline as Pipeline_imb
from imblearn import under_sampling
from imblearn import over_sampling
from sklearn.preprocessing import StandardScaler
%reload_ext autoreload
from my_functions import *
pd.set_option('display.max_column',122)
Commençons par afficher et éventuellement fusionner les données source.
Nous avons à notre disposition plusieurs fichiers sous format csv. Après analyse, il s'agit de 3 types d'informations :
Home CréditHome Crédit# Liste de fichiers à notre disposition
print(os.listdir(CSV_PATH))
['application_test.csv', 'bureau.csv', 'installments_payments.csv', 'application_train.csv', 'previous_application.csv', 'sample_submission.csv', 'POS_CASH_balance.csv', 'bureau_balance.csv', 'credit_card_balance.csv', 'HomeCredit_columns_description.csv']
Après examen du fichier descriptif des variables, nous avons décidé de sélectionner 3 fichiers source qui nous permettront de tirer le maximum d'informations utiles à l'entraînement du modèle :
Chargement de fichiers
application_train = pd.read_csv(CSV_PATH + 'application_train.csv')
bureau = pd.read_csv(CSV_PATH + 'bureau.csv')
credit_card_balance = pd.read_csv(CSV_PATH + 'credit_card_balance.csv')
print('Data shape: ', application_train.shape)
application_train.head()
Data shape: (307511, 122)
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | 0.0252 | 0.0383 | 0.9722 | 0.6341 | 0.0144 | 0.0000 | 0.0690 | 0.0833 | 0.1250 | 0.0377 | 0.022 | 0.0198 | 0.0 | 0.0 | 0.0250 | 0.0369 | 0.9722 | 0.6243 | 0.0144 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0375 | 0.0205 | 0.0193 | 0.0000 | 0.00 | reg oper account | block of flats | 0.0149 | Stone, brick | No | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | 0.0924 | 0.0538 | 0.9851 | 0.8040 | 0.0497 | 0.0806 | 0.0345 | 0.2917 | 0.3333 | 0.0128 | 0.079 | 0.0554 | 0.0 | 0.0 | 0.0968 | 0.0529 | 0.9851 | 0.7987 | 0.0608 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0132 | 0.0787 | 0.0558 | 0.0039 | 0.01 | reg oper account | block of flats | 0.0714 | Block | No | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 26.0 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.0 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Ce dataset compte :
Remarques :
Prêt à dépenser.TARGETprint(' Data shape : ', bureau.shape)
bureau.head()
Data shape : (1716428, 17)
| SK_ID_CURR | SK_ID_BUREAU | CREDIT_ACTIVE | CREDIT_CURRENCY | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | DAYS_CREDIT_UPDATE | AMT_ANNUITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 215354 | 5714462 | Closed | currency 1 | -497 | 0 | -153.0 | -153.0 | NaN | 0 | 91323.0 | 0.0 | NaN | 0.0 | Consumer credit | -131 | NaN |
| 1 | 215354 | 5714463 | Active | currency 1 | -208 | 0 | 1075.0 | NaN | NaN | 0 | 225000.0 | 171342.0 | NaN | 0.0 | Credit card | -20 | NaN |
| 2 | 215354 | 5714464 | Active | currency 1 | -203 | 0 | 528.0 | NaN | NaN | 0 | 464323.5 | NaN | NaN | 0.0 | Consumer credit | -16 | NaN |
| 3 | 215354 | 5714465 | Active | currency 1 | -203 | 0 | NaN | NaN | NaN | 0 | 90000.0 | NaN | NaN | 0.0 | Credit card | -16 | NaN |
| 4 | 215354 | 5714466 | Active | currency 1 | -629 | 0 | 1197.0 | NaN | 77674.5 | 0 | 2700000.0 | NaN | NaN | 0.0 | Consumer credit | -21 | NaN |
Ce dataset compte :
Remarques :
bureau nous fournit des données concernant tous les crédits antérieurs des clients auprès d'autres institutions financières qui ont été rapportés au Bureau de crédit (une organisation privée émettant des rapports de crédit).bureau. Un emprunt dans le fichier application peut avoir plusieurs crédits antérieurs. Pour chaque emprunt, il y a autant de lignes que de crédits enregistrés par le Bureau de crédit avant la date de demande. Cela dit, vérifions le nombre réél de clients faisant partie de cette base de données en calculant le nombre d'ID client :
len(bureau['SK_ID_CURR'].value_counts())
305811
print('Data shape : ', credit_card_balance.shape)
credit_card_balance.head()
Data shape : (3840312, 23)
| SK_ID_PREV | SK_ID_CURR | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | AMT_DRAWINGS_ATM_CURRENT | AMT_DRAWINGS_CURRENT | AMT_DRAWINGS_OTHER_CURRENT | AMT_DRAWINGS_POS_CURRENT | AMT_INST_MIN_REGULARITY | AMT_PAYMENT_CURRENT | AMT_PAYMENT_TOTAL_CURRENT | AMT_RECEIVABLE_PRINCIPAL | AMT_RECIVABLE | AMT_TOTAL_RECEIVABLE | CNT_DRAWINGS_ATM_CURRENT | CNT_DRAWINGS_CURRENT | CNT_DRAWINGS_OTHER_CURRENT | CNT_DRAWINGS_POS_CURRENT | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2562384 | 378907 | -6 | 56.970 | 135000 | 0.0 | 877.5 | 0.0 | 877.5 | 1700.325 | 1800.0 | 1800.0 | 0.000 | 0.000 | 0.000 | 0.0 | 1 | 0.0 | 1.0 | 35.0 | Active | 0 | 0 |
| 1 | 2582071 | 363914 | -1 | 63975.555 | 45000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 2250.000 | 2250.0 | 2250.0 | 60175.080 | 64875.555 | 64875.555 | 1.0 | 1 | 0.0 | 0.0 | 69.0 | Active | 0 | 0 |
| 2 | 1740877 | 371185 | -7 | 31815.225 | 450000 | 0.0 | 0.0 | 0.0 | 0.0 | 2250.000 | 2250.0 | 2250.0 | 26926.425 | 31460.085 | 31460.085 | 0.0 | 0 | 0.0 | 0.0 | 30.0 | Active | 0 | 0 |
| 3 | 1389973 | 337855 | -4 | 236572.110 | 225000 | 2250.0 | 2250.0 | 0.0 | 0.0 | 11795.760 | 11925.0 | 11925.0 | 224949.285 | 233048.970 | 233048.970 | 1.0 | 1 | 0.0 | 0.0 | 10.0 | Active | 0 | 0 |
| 4 | 1891521 | 126868 | -1 | 453919.455 | 450000 | 0.0 | 11547.0 | 0.0 | 11547.0 | 22924.890 | 27000.0 | 27000.0 | 443044.395 | 453919.455 | 453919.455 | 0.0 | 1 | 0.0 | 1.0 | 101.0 | Active | 0 | 0 |
Ce dataset compte :
Remarques :
Vérifions le nombre d'ID uniques :
len(credit_card_balance['SK_ID_CURR'].value_counts())
103558
Au total, les trois fichiers fournissent les informations au sein des 160 variables.
A présent, nous allons fusionner les trois fichiers. Nous allons garder seulement les variables les plus intéressantes.
Application train
app_train = application_train[['SK_ID_CURR','TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER','FLAG_OWN_CAR', 'FLAG_OWN_REALTY',
'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY','AMT_GOODS_PRICE',
'NAME_INCOME_TYPE','NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS','NAME_HOUSING_TYPE','DAYS_BIRTH',
'DAYS_EMPLOYED','OWN_CAR_AGE','FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE',
'FLAG_CONT_MOBILE','FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE',
'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY', 'REG_REGION_NOT_LIVE_REGION','REG_CITY_NOT_WORK_CITY',
'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE',
'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']]
Pour la suite de l'analyse, nous allons sélectionner une partie de la base de données, vu sa volumétrie importante :
app_train_sample = app_train.sample(frac=0.2, random_state=1)
Bureau
Nous supprimons les variables inintéressantes :
bureau.drop(bureau[['SK_ID_BUREAU', 'CREDIT_CURRENCY','DAYS_CREDIT_UPDATE']], axis=1, inplace=True)
Fusion de datasets
# Application train & Bureau
df_fusion = pd.merge(app_train_sample, bureau, on='SK_ID_CURR')
print("Le nouveau dataframe a ",df_fusion.shape[0]," lignes et ",
df_fusion.shape[1]," colonnes.")
Le nouveau dataframe a 291725 lignes et 48 colonnes.
# Application train & Bureau & Crédit Card Balance
df_fusion_bis = pd.merge(df_fusion, credit_card_balance[['SK_ID_CURR','MONTHS_BALANCE', 'AMT_BALANCE', 'AMT_CREDIT_LIMIT_ACTUAL', 'CNT_INSTALMENT_MATURE_CUM',
'NAME_CONTRACT_STATUS', 'SK_DPD', 'SK_DPD_DEF']], on='SK_ID_CURR', how='left')
df = df_fusion_bis.copy()
print('Data shape: ', df.shape)
df.head()
Data shape: (3912016, 55)
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY_x | AMT_GOODS_PRICE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | DAYS_BIRTH | DAYS_EMPLOYED | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | REG_REGION_NOT_LIVE_REGION | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | CREDIT_ACTIVE | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | AMT_CREDIT_MAX_OVERDUE | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | AMT_ANNUITY_y | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 174545 | 1 | Cash loans | F | N | Y | 0 | 135000.0 | 654498.0 | 31617.0 | 585000.0 | Pensioner | Secondary / secondary special | Married | House / apartment | -20614 | 365243 | NaN | 1 | 0 | 0 | 1 | 0 | 0 | NaN | 2.0 | 2 | 2 | 0 | 0 | 0 | XNA | NaN | 0.622652 | 0.602386 | Closed | -958 | 0 | -866.0 | -866.0 | NaN | 0 | 29704.5 | 0.0 | 0.0 | 0.0 | Consumer credit | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 174545 | 1 | Cash loans | F | N | Y | 0 | 135000.0 | 654498.0 | 31617.0 | 585000.0 | Pensioner | Secondary / secondary special | Married | House / apartment | -20614 | 365243 | NaN | 1 | 0 | 0 | 1 | 0 | 0 | NaN | 2.0 | 2 | 2 | 0 | 0 | 0 | XNA | NaN | 0.622652 | 0.602386 | Closed | -1181 | 0 | -85.0 | -1027.0 | NaN | 0 | 405000.0 | NaN | NaN | 0.0 | Consumer credit | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 174545 | 1 | Cash loans | F | N | Y | 0 | 135000.0 | 654498.0 | 31617.0 | 585000.0 | Pensioner | Secondary / secondary special | Married | House / apartment | -20614 | 365243 | NaN | 1 | 0 | 0 | 1 | 0 | 0 | NaN | 2.0 | 2 | 2 | 0 | 0 | 0 | XNA | NaN | 0.622652 | 0.602386 | Closed | -1316 | 0 | -950.0 | -1181.0 | 0.000 | 0 | 225000.0 | 0.0 | 0.0 | 0.0 | Consumer credit | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 174545 | 1 | Cash loans | F | N | Y | 0 | 135000.0 | 654498.0 | 31617.0 | 585000.0 | Pensioner | Secondary / secondary special | Married | House / apartment | -20614 | 365243 | NaN | 1 | 0 | 0 | 1 | 0 | 0 | NaN | 2.0 | 2 | 2 | 0 | 0 | 0 | XNA | NaN | 0.622652 | 0.602386 | Sold | -1026 | 0 | 69.0 | -189.0 | 8085.105 | 0 | 450000.0 | 0.0 | 0.0 | 0.0 | Consumer credit | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 174545 | 1 | Cash loans | F | N | Y | 0 | 135000.0 | 654498.0 | 31617.0 | 585000.0 | Pensioner | Secondary / secondary special | Married | House / apartment | -20614 | 365243 | NaN | 1 | 0 | 0 | 1 | 0 | 0 | NaN | 2.0 | 2 | 2 | 0 | 0 | 0 | XNA | NaN | 0.622652 | 0.602386 | Closed | -903 | 0 | NaN | -497.0 | 0.000 | 0 | 346500.0 | 0.0 | 0.0 | 0.0 | Credit card | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
df.dtypes.value_counts().plot.bar(title = 'Types de variables', cmap="Accent")
plt.grid()
plt.show()
Trois types de variables dans notre jeu de données : des float, des integers et des strings. Vérifions les modalités des features de type object :
for key, value in df.select_dtypes('object').iteritems():
print(f'{key :-<50} {len(df[key].unique())}')
NAME_CONTRACT_TYPE-------------------------------- 2 CODE_GENDER--------------------------------------- 3 FLAG_OWN_CAR-------------------------------------- 2 FLAG_OWN_REALTY----------------------------------- 2 NAME_INCOME_TYPE---------------------------------- 6 NAME_EDUCATION_TYPE------------------------------- 5 NAME_FAMILY_STATUS-------------------------------- 5 NAME_HOUSING_TYPE--------------------------------- 6 OCCUPATION_TYPE----------------------------------- 19 ORGANIZATION_TYPE--------------------------------- 58 CREDIT_ACTIVE------------------------------------- 4 CREDIT_TYPE--------------------------------------- 13 NAME_CONTRACT_STATUS------------------------------ 8
Les features de type object contiennent peu de modalités, sauf les suivantes :
for key, value in df.select_dtypes('int').iteritems():
print(f'{key :-<50} {len(df[key].unique())}')
SK_ID_CURR---------------------------------------- 52782 TARGET-------------------------------------------- 2 CNT_CHILDREN-------------------------------------- 8 DAYS_BIRTH---------------------------------------- 15744 DAYS_EMPLOYED------------------------------------- 8328 FLAG_MOBIL---------------------------------------- 1 FLAG_EMP_PHONE------------------------------------ 2 FLAG_WORK_PHONE----------------------------------- 2 FLAG_CONT_MOBILE---------------------------------- 2 FLAG_PHONE---------------------------------------- 2 FLAG_EMAIL---------------------------------------- 2 REGION_RATING_CLIENT------------------------------ 3 REGION_RATING_CLIENT_W_CITY----------------------- 3 REG_REGION_NOT_LIVE_REGION------------------------ 2 REG_CITY_NOT_WORK_CITY---------------------------- 2 LIVE_CITY_NOT_WORK_CITY--------------------------- 2 DAYS_CREDIT--------------------------------------- 2923 CREDIT_DAY_OVERDUE-------------------------------- 289 CNT_CREDIT_PROLONG-------------------------------- 6
for key, value in df.select_dtypes('float').iteritems():
print(f'{key :-<50} {len(df[key].unique())}')
AMT_INCOME_TOTAL---------------------------------- 760 AMT_CREDIT---------------------------------------- 3420 AMT_ANNUITY_x------------------------------------- 8889 AMT_GOODS_PRICE----------------------------------- 507 OWN_CAR_AGE--------------------------------------- 54 CNT_FAM_MEMBERS----------------------------------- 9 EXT_SOURCE_1-------------------------------------- 22804 EXT_SOURCE_2-------------------------------------- 40971 EXT_SOURCE_3-------------------------------------- 744 DAYS_CREDIT_ENDDATE------------------------------- 9858 DAYS_ENDDATE_FACT--------------------------------- 2872 AMT_CREDIT_MAX_OVERDUE---------------------------- 15439 AMT_CREDIT_SUM------------------------------------ 70654 AMT_CREDIT_SUM_DEBT------------------------------- 60738 AMT_CREDIT_SUM_LIMIT------------------------------ 9793 AMT_CREDIT_SUM_OVERDUE---------------------------- 428 AMT_ANNUITY_y------------------------------------- 10462 MONTHS_BALANCE------------------------------------ 97 AMT_BALANCE--------------------------------------- 231729 AMT_CREDIT_LIMIT_ACTUAL--------------------------- 131 CNT_INSTALMENT_MATURE_CUM------------------------- 122 SK_DPD-------------------------------------------- 559 SK_DPD_DEF---------------------------------------- 142
Vérifions à présent quel est le taux des NaN :
taux_nan=df.isna().sum()*100/df.shape[0]
fig=plt.figure(figsize=(20,10))
plt.xticks(rotation=45,
horizontalalignment='right',
fontweight='light',
fontsize='small')
plt.bar(x=taux_nan.sort_values().index,
height=taux_nan.sort_values().values)
plt.axhline(y=50, color='y', linestyle='-', linewidth='4', label='50%')
plt.axhline(y=70, color='r', linestyle='-', linewidth='4', label='70%')
plt.title("Taux de données manquantes", fontsize=16)
plt.legend()
plt.show()
Nous observons très peu de variables avec plus de 50% de valeurs manquantes :
DAYS_ENDDAY_FACTEXT_SOURCE_1OWN_CAR_AGEAMT_CREDIT_MAX_OVERDUEAMT_ANNUITY_ySeulement la dernière variable dépasse le seuil des 70%. Le jeu de données est donc plutôt bien rempli.
Pour la suite de l'analyse, nous allons supprimer colonnes comportant plus de 60% de valeurs manquantes :
df.drop(df[['AMT_ANNUITY_y', 'AMT_CREDIT_MAX_OVERDUE', 'OWN_CAR_AGE']], axis=1, inplace=True)
df.shape
(3912016, 52)
Pour des raisons de soucis de mémoire, nous allons continuer notre analyse sur un échantillon df_sampled du dataset initial df.
df_sampled = df.sample(frac=0.10, random_state=1)
df_sampled.shape
(391202, 52)
Vérifions s'il existe des doublons :
doublons = df_sampled[df_sampled.duplicated()]
doublons
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY_x | AMT_GOODS_PRICE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | DAYS_BIRTH | DAYS_EMPLOYED | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | REG_REGION_NOT_LIVE_REGION | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | CREDIT_ACTIVE | DAYS_CREDIT | CREDIT_DAY_OVERDUE | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT | CNT_CREDIT_PROLONG | AMT_CREDIT_SUM | AMT_CREDIT_SUM_DEBT | AMT_CREDIT_SUM_LIMIT | AMT_CREDIT_SUM_OVERDUE | CREDIT_TYPE | MONTHS_BALANCE | AMT_BALANCE | AMT_CREDIT_LIMIT_ACTUAL | CNT_INSTALMENT_MATURE_CUM | NAME_CONTRACT_STATUS | SK_DPD | SK_DPD_DEF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 86249 | 284883 | 0 | Cash loans | F | Y | Y | 0 | 162000.0 | 159264.0 | 10773.0 | 126000.0 | Working | Higher education | Married | House / apartment | -17975 | -1869 | 1 | 1 | 1 | 1 | 0 | 0 | Accountants | 2.0 | 2 | 2 | 0 | 0 | 0 | Self-employed | NaN | 0.296289 | 0.529890 | Active | -302 | 0 | 1509.0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | Credit card | -88.0 | 29229.930 | 180000.0 | 2.0 | Active | 1.0 | 1.0 |
| 2598660 | 134832 | 0 | Cash loans | M | N | Y | 0 | 293850.0 | 540000.0 | 21055.5 | 540000.0 | Commercial associate | Secondary / secondary special | Married | House / apartment | -18336 | -6561 | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 2.0 | 2 | 2 | 0 | 0 | 0 | School | 0.805055 | 0.680782 | 0.404878 | Closed | -1863 | 0 | -38.0 | -743.0 | 0 | 472500.0 | 0.0 | 0.0 | 0.0 | Consumer credit | -13.0 | 0.000 | 0.0 | 23.0 | Active | 0.0 | 0.0 |
| 3359379 | 252779 | 0 | Cash loans | F | N | Y | 2 | 202500.0 | 229500.0 | 12447.0 | 229500.0 | Working | Secondary / secondary special | Married | House / apartment | -10289 | -456 | 1 | 1 | 1 | 1 | 1 | 0 | Accountants | 4.0 | 1 | 1 | 0 | 0 | 0 | Self-employed | 0.700973 | 0.729825 | 0.517297 | Closed | -1710 | 0 | -614.0 | -1666.0 | 0 | 315000.0 | NaN | NaN | 0.0 | Consumer credit | -64.0 | 135452.430 | 135000.0 | 20.0 | Active | 0.0 | 0.0 |
| 1993733 | 401401 | 0 | Cash loans | M | Y | Y | 2 | 189000.0 | 521280.0 | 31630.5 | 450000.0 | Working | Secondary / secondary special | Married | House / apartment | -13651 | -182 | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 4.0 | 2 | 2 | 0 | 1 | 1 | Construction | NaN | 0.666170 | NaN | Active | -740 | 0 | 1066.0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | Credit card | -81.0 | 61005.465 | 67500.0 | 5.0 | Active | 1.0 | 1.0 |
| 1656756 | 190879 | 0 | Cash loans | F | N | Y | 0 | 76500.0 | 358443.0 | 13639.5 | 252000.0 | Commercial associate | Secondary / secondary special | Widow | House / apartment | -20771 | -1039 | 1 | 1 | 0 | 1 | 1 | 0 | Cleaning staff | 1.0 | 2 | 2 | 0 | 0 | 0 | Business Entity Type 3 | 0.632028 | 0.266520 | 0.591977 | Closed | -1740 | 0 | -1282.0 | -1281.0 | 0 | 48510.0 | 0.0 | NaN | 0.0 | Consumer credit | -15.0 | 0.000 | 450000.0 | 0.0 | Active | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3754689 | 450614 | 0 | Cash loans | F | Y | N | 1 | 238500.0 | 202500.0 | 20880.0 | 202500.0 | Working | Secondary / secondary special | Married | House / apartment | -15652 | -8064 | 1 | 1 | 0 | 1 | 0 | 0 | NaN | 3.0 | 2 | 2 | 0 | 0 | 0 | Military | 0.499903 | 0.222820 | 0.180888 | Closed | -2521 | 0 | -1973.0 | -1969.0 | 0 | 202500.0 | 0.0 | 0.0 | 0.0 | Consumer credit | -26.0 | 454606.830 | 450000.0 | 3.0 | Active | 0.0 | 0.0 |
| 2312040 | 106705 | 0 | Cash loans | F | N | N | 1 | 135000.0 | 509400.0 | 40243.5 | 450000.0 | Working | Secondary / secondary special | Married | With parents | -11773 | -311 | 1 | 1 | 1 | 1 | 0 | 0 | Sales staff | 3.0 | 3 | 2 | 0 | 0 | 0 | Business Entity Type 3 | 0.290899 | 0.137186 | 0.291097 | Active | -1177 | 0 | 633.0 | NaN | 0 | 0.0 | 0.0 | 0.0 | 0.0 | Credit card | -3.0 | 0.000 | 0.0 | 4.0 | Active | 0.0 | 0.0 |
| 3489341 | 344536 | 0 | Cash loans | F | N | Y | 4 | 301500.0 | 199152.0 | 10930.5 | 135000.0 | Working | Higher education | Married | House / apartment | -16573 | -6012 | 1 | 1 | 0 | 1 | 0 | 1 | Accountants | 6.0 | 2 | 2 | 0 | 0 | 0 | Agriculture | 0.757593 | 0.454474 | 0.576209 | Closed | -2420 | 0 | -594.0 | -1140.0 | 0 | 1800000.0 | 0.0 | 0.0 | 0.0 | Consumer credit | -36.0 | 332388.450 | 450000.0 | NaN | Active | 0.0 | 0.0 |
| 1536121 | 294723 | 0 | Cash loans | F | N | Y | 0 | 263700.0 | 495000.0 | 16096.5 | 495000.0 | Commercial associate | Secondary / secondary special | Separated | House / apartment | -21550 | -2142 | 1 | 1 | 0 | 1 | 0 | 0 | Security staff | 1.0 | 2 | 2 | 0 | 0 | 0 | Business Entity Type 2 | NaN | 0.584643 | 0.245851 | Closed | -2467 | 0 | -641.0 | -1810.0 | 0 | 112500.0 | 0.0 | 0.0 | 0.0 | Consumer credit | -4.0 | 27144.000 | 450000.0 | 11.0 | Active | 0.0 | 0.0 |
| 2438320 | 347134 | 0 | Cash loans | F | N | Y | 0 | 202500.0 | 797818.5 | 33930.0 | 634500.0 | Working | Secondary / secondary special | Civil marriage | House / apartment | -14725 | -137 | 1 | 1 | 0 | 1 | 0 | 0 | Medicine staff | 2.0 | 1 | 1 | 0 | 0 | 0 | Medicine | NaN | 0.719119 | 0.652897 | Closed | -1222 | 0 | -126.0 | -736.0 | 0 | 1110735.0 | NaN | NaN | 0.0 | Car loan | -54.0 | 0.000 | 0.0 | 7.0 | Active | 0.0 | 0.0 |
128 rows × 52 columns
128 doublons ! Nous allons supprimer les doublons en gardant l'observation la plus récente :
df_sampled.drop_duplicates(inplace=True)
df_sampled.shape
(391074, 52)
Nous remarquons la présence de la variable TARGET. Cette variable est binaire et contient les étiquettes suivantes :
Vérifions sa fréquence et distribution :
df_sampled["TARGET"].value_counts(normalize=True).plot.bar(title = 'Répartition 0/1')
df_sampled["TARGET"].value_counts(normalize=True)*100
plt.xlabel('Catégorie TARGET')
plt.ylabel('Nombre d\'emprunts')
plt.grid()
plt.show()
La variable TARGET contient deux modalités : 1 et 0.
0.1 sont des clients à risque. Nous constatons également que la variable TARGET n'est pas équilibrée. Nous allons revenir à ce sujet plus loin dans notre analyse.
Vérifions les chiffres exactes d'emprunts :
df_sampled['TARGET'].value_counts(normalize=True)
0 0.929535 1 0.070465 Name: TARGET, dtype: float64
Près de 93% de personnes ont réussi à rembourser leur prêt dans les délais, contre un peu plus de 7% qui ont fait défaut.
Dans un premier temps nous allons identifier les variables qualitatives. Ensuite, nous allons afficher quelques indicateurs statistiques.
var_cat=df_sampled.select_dtypes(include="object")
var_cat.columns.sort_values().to_list()
['CODE_GENDER', 'CREDIT_ACTIVE', 'CREDIT_TYPE', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_CONTRACT_STATUS', 'NAME_CONTRACT_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'NAME_INCOME_TYPE', 'OCCUPATION_TYPE', 'ORGANIZATION_TYPE']
show_stats_var_cat(var_cat)
| Nb de valeurs | Nb de modalités | Mode | Effectifs du mode | Freq du mode | % NaN | |
|---|---|---|---|---|---|---|
| NAME_CONTRACT_TYPE | 391074 | 2 | Cash loans | 381075 | 0.974432 | 0.000000 |
| CODE_GENDER | 391074 | 2 | F | 268551 | 0.686701 | 0.000000 |
| FLAG_OWN_CAR | 391074 | 2 | N | 246961 | 0.631494 | 0.000000 |
| FLAG_OWN_REALTY | 391074 | 2 | Y | 273286 | 0.698809 | 0.000000 |
| NAME_INCOME_TYPE | 391074 | 5 | Working | 200680 | 0.513151 | 0.000000 |
| NAME_EDUCATION_TYPE | 391074 | 5 | Secondary / secondary special | 279438 | 0.71454 | 0.000000 |
| NAME_FAMILY_STATUS | 391074 | 5 | Married | 270490 | 0.691659 | 0.000000 |
| NAME_HOUSING_TYPE | 391074 | 6 | House / apartment | 353855 | 0.904829 | 0.000000 |
| OCCUPATION_TYPE | 280141 | 18 | Laborers | 78839 | 0.281426 | 28.366243 |
| ORGANIZATION_TYPE | 391074 | 58 | Business Entity Type 3 | 84195 | 0.215292 | 0.000000 |
| CREDIT_ACTIVE | 391074 | 3 | Closed | 244903 | 0.626232 | 0.000000 |
| CREDIT_TYPE | 391074 | 11 | Consumer credit | 273688 | 0.699837 | 0.000000 |
| NAME_CONTRACT_STATUS | 370911 | 5 | Active | 356955 | 0.962374 | 5.155802 |
Conclusions
CODE_GENDER contient 3 genres, à vérifier.Cash loans (la fréquence du mode : 0.904787).OCCUPATION_TYPE) et 58 (ORGANIZATION_TYPE).Nous allons à présent passer en revue certaines variables.
Nous allons afficher toutes les variables catégorielles pour vérifier rapidement leur distribution :
distrib_var_cat(var_cat)
Variable 'CODE_GENDER'
analyze_uni_cat (df_sampled, 'CODE_GENDER')
F 68.670124 M 31.329876 Name: CODE_GENDER, dtype: float64
Conclusions
df_sampled.loc[df_sampled["CODE_GENDER"] == "XNA", "CODE_GENDER"] = df["CODE_GENDER"].mode()[0]
#Vérification
df_sampled.CODE_GENDER.unique()
array(['M', 'F'], dtype=object)
Variable 'NAME_FAMILY_STATUS'
analyze_uni_cat (df_sampled, 'NAME_FAMILY_STATUS')
Married 69.165938 Single / not married 10.944987 Civil marriage 8.431141 Separated 6.566788 Widow 4.891146 Name: NAME_FAMILY_STATUS, dtype: float64
Conclusions
Unknown doit disparaître - à remplacer par le mode :df_sampled.loc[df_sampled['NAME_FAMILY_STATUS'] =='Unknown', "NAME_FAMILY_STATUS"] = df_sampled["NAME_FAMILY_STATUS"].mode()[0]
#Vérification
df_sampled.NAME_FAMILY_STATUS.unique()
array(['Married', 'Single / not married', 'Civil marriage', 'Widow',
'Separated'], dtype=object)
Variable 'NAME_INCOME_TYPE'
analyze_uni_cat (df_sampled, 'NAME_INCOME_TYPE')
Working 51.315096 Commercial associate 25.429970 Pensioner 14.780579 State servant 8.468474 Student 0.005881 Name: NAME_INCOME_TYPE, dtype: float64
Conclusions
Variable 'NAME_EDUCATION_TYPE'
analyze_uni_cat (df_sampled, 'NAME_EDUCATION_TYPE')
Secondary / secondary special 71.453996 Higher education 24.958448 Incomplete higher 2.656019 Lower secondary 0.894460 Academic degree 0.037077 Name: NAME_EDUCATION_TYPE, dtype: float64
Conclusions
Dans un premier temps nous allons identifier les variables quantitatives de type INTEGER, ensuite les FLOAT.
df_int = df_sampled.select_dtypes('int')
df_int.shape[1]
19
show_stats_var_num(df_int)
| mean | std | min | max | 25% | 50% | 75% | % NaN | % null | skewness | |
|---|---|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | 278379.39 | 102349.83 | 100003.00 | 456239.00 | 189614.00 | 278162.00 | 366581.00 | 0.0000 | 0.0000 | -0.0081 |
| TARGET | 0.07 | 0.26 | 0.0000 | 1.00 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 92.95 | 3.36 |
| CNT_CHILDREN | 0.41 | 0.73 | 0.0000 | 5.00 | 0.0000 | 0.0000 | 1.00 | 0.0000 | 70.85 | 1.75 |
| DAYS_BIRTH | -16699.99 | 3591.76 | -25200.00 | -7680.00 | -19677.00 | -16573.00 | -13874.00 | 0.0000 | 0.0000 | -0.01 |
| DAYS_EMPLOYED | 51352.13 | 130747.36 | -16651.00 | 365243.00 | -3819.00 | -1889.00 | -538.00 | 0.0000 | 0.0008 | 1.98 |
| FLAG_MOBIL | 1.00 | 0.0000 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.0000 | 0.0000 | 0.0000 |
| FLAG_EMP_PHONE | 0.85 | 0.35 | 0.0000 | 1.00 | 1.00 | 1.00 | 1.00 | 0.0000 | 14.78 | -1.98 |
| FLAG_WORK_PHONE | 0.19 | 0.39 | 0.0000 | 1.00 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 81.00 | 1.58 |
| FLAG_CONT_MOBILE | 1.00 | 0.02 | 0.0000 | 1.00 | 1.00 | 1.00 | 1.00 | 0.0000 | 0.06 | -41.38 |
| FLAG_PHONE | 0.31 | 0.46 | 0.0000 | 1.00 | 0.0000 | 0.0000 | 1.00 | 0.0000 | 68.78 | 0.81 |
| FLAG_EMAIL | 0.09 | 0.28 | 0.0000 | 1.00 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 91.31 | 2.93 |
| REGION_RATING_CLIENT | 2.03 | 0.49 | 1.00 | 3.00 | 2.00 | 2.00 | 2.00 | 0.0000 | 0.0000 | 0.06 |
| REGION_RATING_CLIENT_W_CITY | 2.01 | 0.49 | 1.00 | 3.00 | 2.00 | 2.00 | 2.00 | 0.0000 | 0.0000 | 0.02 |
| REG_REGION_NOT_LIVE_REGION | 0.0084 | 0.09 | 0.0000 | 1.00 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 99.16 | 10.76 |
| REG_CITY_NOT_WORK_CITY | 0.20 | 0.40 | 0.0000 | 1.00 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 80.37 | 1.53 |
| LIVE_CITY_NOT_WORK_CITY | 0.17 | 0.37 | 0.0000 | 1.00 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 83.24 | 1.78 |
| DAYS_CREDIT | -1214.93 | 811.59 | -2922.00 | -1.00 | -1785.00 | -1081.00 | -525.00 | 0.0000 | 0.0000 | -0.47 |
| CREDIT_DAY_OVERDUE | 0.70 | 35.15 | 0.0000 | 2656.00 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 99.79 | 60.93 |
| CNT_CREDIT_PROLONG | 0.0079 | 0.11 | 0.0000 | 5.00 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 99.36 | 17.82 |
Remarques
19 variables de type int figurent dans le dataset.
DAYS_BIRTH, DAYS_EMPLOYED et DAYS_CREDITAnalysons de plus près les variables qui semblent incohérentes.
Variables DAYS_BIRTH, DAYS_EMPLOYED et DAYS_CREDIT
Nous constatons que les chiffres des variables préfixées par DAYS_ sont négatifs.
La raison est assez simple : le calcul est fait en arrière à partir du jour de dépôt du dossier. De plus, les chiffres représentent les jours, or il serait plus intéressant de voir ces variables exprimées en années.
Le calcul à effectuer :
df_sampled['DAYS_BIRTH'] = df_sampled['DAYS_BIRTH'] / -365
(df_sampled['DAYS_EMPLOYED'] / 365).describe()
count 391074.000000 mean 140.690781 std 358.211947 min -45.619178 25% -10.463014 50% -5.175342 75% -1.473973 max 1000.665753 Name: DAYS_EMPLOYED, dtype: float64
Attention, nous obtenons une valeur max qui non seulement est positive, mais en plus indique 1000 années... Il s'agit clairement d'un outlier. Essayons de visualiser le résultat :
analyze_uni_num (df_sampled, 'DAYS_EMPLOYED')
Il s'agit d'un seul outlier. Il suffit donc de le supprimer.
df_sampled['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)
df_sampled['DAYS_EMPLOYED'] = df_sampled['DAYS_EMPLOYED'] / -1
Passons également la variable DAYS_CREDIT au positif :
df_sampled['DAYS_CREDIT'] = df_sampled['DAYS_CREDIT'] / -1
Les résultats sont corrects à présent. Vérifions les résultats :
df_sampled.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| SK_ID_CURR | 391074.0 | 278379.392279 | 102349.831332 | 100003.000000 | 189614.000000 | 278162.000000 | 366581.000000 | 4.562390e+05 |
| TARGET | 391074.0 | 0.070465 | 0.255929 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000e+00 |
| CNT_CHILDREN | 391074.0 | 0.414208 | 0.726146 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 5.000000e+00 |
| AMT_INCOME_TOTAL | 391074.0 | 190383.707023 | 102565.207781 | 27000.000000 | 126000.000000 | 171000.000000 | 225000.000000 | 4.500000e+06 |
| AMT_CREDIT | 391074.0 | 679949.043545 | 408625.294977 | 45000.000000 | 354276.000000 | 598500.000000 | 900000.000000 | 4.050000e+06 |
| AMT_ANNUITY_x | 391039.0 | 28820.579727 | 13743.362906 | 1980.000000 | 19188.000000 | 26550.000000 | 36315.000000 | 2.250000e+05 |
| AMT_GOODS_PRICE | 391043.0 | 609067.279643 | 377202.796000 | 40500.000000 | 315000.000000 | 517500.000000 | 855000.000000 | 4.050000e+06 |
| DAYS_BIRTH | 391074.0 | 45.753390 | 9.840437 | 21.041096 | 38.010959 | 45.405479 | 53.909589 | 6.904110e+01 |
| DAYS_EMPLOYED | 333271.0 | 3089.546636 | 2671.501938 | -0.000000 | 1089.000000 | 2431.000000 | 4256.000000 | 1.665100e+04 |
| FLAG_MOBIL | 391074.0 | 1.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000e+00 |
| FLAG_EMP_PHONE | 391074.0 | 0.852184 | 0.354918 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000e+00 |
| FLAG_WORK_PHONE | 391074.0 | 0.189967 | 0.392275 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000e+00 |
| FLAG_CONT_MOBILE | 391074.0 | 0.999417 | 0.024139 | 0.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000e+00 |
| FLAG_PHONE | 391074.0 | 0.312204 | 0.463393 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000e+00 |
| FLAG_EMAIL | 391074.0 | 0.086935 | 0.281740 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000e+00 |
| CNT_FAM_MEMBERS | 391074.0 | 2.190179 | 0.894157 | 1.000000 | 2.000000 | 2.000000 | 3.000000 | 7.000000e+00 |
| REGION_RATING_CLIENT | 391074.0 | 2.027084 | 0.494252 | 1.000000 | 2.000000 | 2.000000 | 2.000000 | 3.000000e+00 |
| REGION_RATING_CLIENT_W_CITY | 391074.0 | 2.006152 | 0.485356 | 1.000000 | 2.000000 | 2.000000 | 2.000000 | 3.000000e+00 |
| REG_REGION_NOT_LIVE_REGION | 391074.0 | 0.008418 | 0.091362 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000e+00 |
| REG_CITY_NOT_WORK_CITY | 391074.0 | 0.196272 | 0.397177 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000e+00 |
| LIVE_CITY_NOT_WORK_CITY | 391074.0 | 0.167608 | 0.373518 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000e+00 |
| EXT_SOURCE_1 | 188456.0 | 0.549606 | 0.199424 | 0.018895 | 0.401471 | 0.563856 | 0.713348 | 9.416521e-01 |
| EXT_SOURCE_2 | 390915.0 | 0.534749 | 0.180727 | 0.000006 | 0.436682 | 0.582900 | 0.672828 | 8.549997e-01 |
| EXT_SOURCE_3 | 381329.0 | 0.477602 | 0.190003 | 0.000527 | 0.340906 | 0.493863 | 0.627991 | 8.825303e-01 |
| DAYS_CREDIT | 391074.0 | 1214.927809 | 811.588457 | 1.000000 | 525.000000 | 1081.000000 | 1785.000000 | 2.922000e+03 |
| CREDIT_DAY_OVERDUE | 391074.0 | 0.701770 | 35.153523 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.656000e+03 |
| DAYS_CREDIT_ENDDATE | 366381.0 | 550.938698 | 5180.015931 | -42056.000000 | -1191.000000 | -349.000000 | 552.000000 | 3.119800e+04 |
| DAYS_ENDDATE_FACT | 245603.0 | -1061.159306 | 729.532422 | -42023.000000 | -1550.000000 | -945.000000 | -461.000000 | -1.000000e+00 |
| CNT_CREDIT_PROLONG | 391074.0 | 0.007927 | 0.108448 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.000000e+00 |
| AMT_CREDIT_SUM | 391074.0 | 363678.806488 | 866707.623500 | 0.000000 | 52650.000000 | 135000.000000 | 355500.000000 | 6.750000e+07 |
| AMT_CREDIT_SUM_DEBT | 329744.0 | 141242.332437 | 734818.289177 | -701383.230000 | 0.000000 | 0.000000 | 42102.000000 | 6.457024e+07 |
| AMT_CREDIT_SUM_LIMIT | 256884.0 | 8541.324067 | 49922.590738 | -399166.875000 | 0.000000 | 0.000000 | 0.000000 | 1.350000e+06 |
| AMT_CREDIT_SUM_OVERDUE | 391074.0 | 42.737197 | 5358.217523 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.025636e+06 |
| MONTHS_BALANCE | 370911.0 | -35.861856 | 26.579106 | -96.000000 | -57.000000 | -30.000000 | -12.000000 | -1.000000e+00 |
| AMT_BALANCE | 370911.0 | 57574.382205 | 106731.428850 | -135359.010000 | 0.000000 | 0.000000 | 88149.645000 | 9.909802e+05 |
| AMT_CREDIT_LIMIT_ACTUAL | 370911.0 | 146885.155738 | 158608.078816 | 0.000000 | 45000.000000 | 112500.000000 | 180000.000000 | 1.350000e+06 |
| CNT_INSTALMENT_MATURE_CUM | 339134.0 | 21.134997 | 19.491909 | 0.000000 | 5.000000 | 16.000000 | 33.000000 | 1.180000e+02 |
| SK_DPD | 370911.0 | 9.235795 | 95.024056 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.466000e+03 |
| SK_DPD_DEF | 370911.0 | 0.107309 | 11.019107 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.466000e+03 |
Distribution
distrib_var_num (df_int)
Un coup d'oeil rapide sur la feature CNT_CHILDREN pour se rendre compte de sa distribution :
Variable CNT_CHILDREN
analyze_uni_cat (df, 'CNT_CHILDREN')
0 70.827011 1 18.666463 2 8.812515 3 1.545801 4 0.137321 5 0.010557 7 0.000230 6 0.000102 Name: CNT_CHILDREN, dtype: float64
Conclusions
df_float = df_sampled.select_dtypes('float')
df_float.shape
(391074, 23)
df_float_sampled = df_float.sample(frac=0.5, random_state=1)
df_float_sampled.shape
(195537, 23)
show_stats_var_num(df_float_sampled)
| mean | std | min | max | 25% | 50% | 75% | % NaN | % null | skewness | |
|---|---|---|---|---|---|---|---|---|---|---|
| AMT_INCOME_TOTAL | 190541.60 | 102557.22 | 27000.00 | 1800000.00 | 130500.00 | 171000.00 | 225000.00 | 0.0000 | 0.0000 | 3.18 |
| AMT_CREDIT | 680343.39 | 408604.14 | 45000.00 | 4050000.00 | 355500.00 | 599116.50 | 900000.00 | 0.0000 | 0.0000 | 0.92 |
| AMT_ANNUITY_x | 28842.10 | 13735.26 | 1980.00 | 180000.00 | 19210.50 | 26608.50 | 36328.50 | 0.0077 | 0.0000 | 1.10 |
| AMT_GOODS_PRICE | 609178.06 | 376837.60 | 45000.00 | 4050000.00 | 315000.00 | 517500.00 | 855000.00 | 0.0077 | 0.0000 | 1.01 |
| DAYS_BIRTH | 45.76 | 9.85 | 21.08 | 69.04 | 37.96 | 45.41 | 53.93 | 0.0000 | 0.0000 | 0.01 |
| DAYS_EMPLOYED | 3084.21 | 2670.35 | 4.00 | 16651.00 | 1085.00 | 2423.00 | 4249.00 | 14.85 | 0.0000 | 1.53 |
| CNT_FAM_MEMBERS | 2.19 | 0.89 | 1.00 | 7.00 | 2.00 | 2.00 | 3.00 | 0.0000 | 0.0000 | 0.91 |
| EXT_SOURCE_1 | 0.55 | 0.20 | 0.02 | 0.94 | 0.40 | 0.56 | 0.71 | 51.87 | 0.0000 | -0.31 |
| EXT_SOURCE_2 | 0.53 | 0.18 | 0.0000 | 0.82 | 0.44 | 0.58 | 0.67 | 0.04 | 0.0000 | -0.93 |
| EXT_SOURCE_3 | 0.48 | 0.19 | 0.0005 | 0.88 | 0.34 | 0.49 | 0.63 | 2.47 | 0.0000 | -0.26 |
| DAYS_CREDIT | 1215.73 | 812.04 | 1.00 | 2922.00 | 526.00 | 1081.00 | 1788.00 | 0.0000 | 0.0000 | 0.47 |
| DAYS_CREDIT_ENDDATE | 555.26 | 5194.58 | -41938.00 | 31198.00 | -1193.00 | -350.00 | 557.00 | 6.27 | 0.04 | 4.83 |
| DAYS_ENDDATE_FACT | -1062.41 | 738.28 | -42023.00 | -1.00 | -1555.00 | -946.00 | -462.00 | 37.15 | 0.0000 | -3.27 |
| AMT_CREDIT_SUM | 363680.71 | 859725.95 | 0.0000 | 67500000.00 | 52771.50 | 135000.00 | 358200.00 | 0.0000 | 4.37 | 15.36 |
| AMT_CREDIT_SUM_DEBT | 139830.14 | 686543.62 | -701383.23 | 64570243.50 | 0.0000 | 0.0000 | 41575.50 | 15.65 | 58.70 | 31.31 |
| AMT_CREDIT_SUM_LIMIT | 8532.63 | 50524.27 | -399166.88 | 1350000.00 | 0.0000 | 0.0000 | 0.0000 | 34.20 | 60.44 | 9.06 |
| AMT_CREDIT_SUM_OVERDUE | 34.39 | 4184.32 | 0.0000 | 1025635.50 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 99.79 | 229.06 |
| MONTHS_BALANCE | -35.83 | 26.56 | -96.00 | -1.00 | -57.00 | -30.00 | -12.00 | 5.12 | 0.0000 | -0.53 |
| AMT_BALANCE | 57771.91 | 107148.65 | -135359.01 | 962510.67 | 0.0000 | 0.0000 | 88330.94 | 5.12 | 54.47 | 2.95 |
| AMT_CREDIT_LIMIT_ACTUAL | 147039.97 | 159006.48 | 0.0000 | 1350000.00 | 45000.00 | 112500.00 | 180000.00 | 5.12 | 20.09 | 2.09 |
| CNT_INSTALMENT_MATURE_CUM | 21.15 | 19.54 | 0.0000 | 118.00 | 5.00 | 16.00 | 33.00 | 13.32 | 11.11 | 1.05 |
| SK_DPD | 9.24 | 95.29 | 0.0000 | 2346.00 | 0.0000 | 0.0000 | 0.0000 | 5.12 | 91.09 | 12.30 |
| SK_DPD_DEF | 0.10 | 9.94 | 0.0000 | 2346.00 | 0.0000 | 0.0000 | 0.0000 | 5.12 | 92.72 | 190.58 |
Remarques
23 variables de type float figurent dans le dataset.
EXT_SOURCE_1 contient plus de 50% de NAN.AMT_INCOME_TOTALAMT_CREDITAMT_ANNUITY_xA vérifier, il s'agit très probablement de valeurs aberrantes.
Analysons de plus près certaines variables.
Variable AMT_INCOME_TOTAL
analyze_uni_num (df_sampled, 'AMT_INCOME_TOTAL')
La variable contient des valeurs aberrantes. Est-ce lié aux disparités dans la société ? Nous allons vérifier cela en affichant les revenus en fonction de l'éducation reçue et du secteur d'activité :
compare_income ("ORGANIZATION_TYPE", df_sampled)
Nous pouvons en conclure que les revenus hors norme sont attribués aux personnes issues du domaine du Business. Un point extrême venant du Transport type 4.
compare_income ("NAME_EDUCATION_TYPE", df_sampled)
Il semblerait que les revenus hors norme sont attribués aux personnes diplômées.
Nous allons remplacer les outliers par la moyenne (pour les valeurs au dessus de 3 écarts type) :
facteur = 3
upper_lim = df_sampled['AMT_INCOME_TOTAL'].mean() + df_sampled['AMT_INCOME_TOTAL'].std()*facteur
df_sampled = df_sampled[(df_sampled['AMT_INCOME_TOTAL'] < upper_lim)]
Variable AMT_CREDIT_SUM
analyze_uni_num (df_sampled, 'AMT_CREDIT_SUM')
Nous constatons de nombreux outliers au sein de cette feature. Idem, essayons de les remplacer par la moyenne (également pour les valeurs au dessus de 3 écarts type) :
facteur = 3
upper_lim = df_sampled['AMT_CREDIT_SUM'].mean() + df_sampled['AMT_CREDIT_SUM'].std()*facteur
df_sampled = df_sampled[(df_sampled['AMT_CREDIT_SUM'] < upper_lim)]
Distribution
distrib_var_num (df_float)
Plus loin, nous allons procéder à une normalisation, étant donné les plages de valeurs très différentes parmi les variables.
L'analyse univariée nous a permis d'avoir un aperçu général du profil des emprunteurs. A présent, nous allons explorer les features en rapport avec la variable TARGET, afin d'émettre des hypothèses plus poussées.
Nous allons passer en revue toutes les variables quantitatives et afficher leur rapport avec la variable TARGET, pour ensuite regarder les résultats de plus près.
analyze_bi_cat_all(var_cat,df_sampled,'TARGET')
Gender
Gender = pd.crosstab(df_sampled["CODE_GENDER"],df_sampled["TARGET"])
print(Gender)
Gender.div(Gender.sum(1).astype(float),axis=0).plot(kind="bar",stacked=True,figsize=(6,6))
plt.xlabel("Gender")
plt.ylabel("Percentage")
plt.show()
TARGET 0 1 CODE_GENDER F 246566 16846 M 107173 10128
Situation familiale
Married=pd.crosstab(df_sampled["NAME_FAMILY_STATUS"],df_sampled["TARGET"])
print(Married)
Married.div(Married.sum(1).astype(float),axis=0).plot(kind="bar",stacked=True,figsize=(6,6))
plt.xlabel("Married")
plt.ylabel("Percentage")
plt.show()
TARGET 0 1 NAME_FAMILY_STATUS Civil marriage 29885 2435 Married 244178 18953 Separated 23092 1892 Single / not married 38518 3052 Widow 18066 642
Progéniture
Children=pd.crosstab(df_sampled["CNT_FAM_MEMBERS"],df_sampled["TARGET"])
print(Children)
Children.div(Children.sum(1).astype(float),axis=0).plot(kind="bar",stacked=True,figsize=(6,6))
plt.xlabel("Children")
plt.ylabel("Percentage")
plt.show()
TARGET 0 1 CNT_FAM_MEMBERS 1.0 66878 4921 2.0 196100 13577 3.0 57352 5126 4.0 27856 2866 5.0 5209 316 6.0 340 168 7.0 4 0
Education
Education=pd.crosstab(df_sampled["NAME_EDUCATION_TYPE"],df_sampled["TARGET"])
print(Education)
Education.div(Education.sum(1).astype(float),axis=0).plot(kind="bar",stacked=True,figsize=(6,6))
plt.xlabel("Education")
plt.ylabel("Percentage")
plt.show()
TARGET 0 1 NAME_EDUCATION_TYPE Academic degree 64 1 Higher education 88223 4350 Incomplete higher 9423 634 Lower secondary 3100 307 Secondary / secondary special 252929 21682
Situation professionnelle
ft='OCCUPATION_TYPE'
analyze_bi_cat_one (df_sampled, ft, 'TARGET')
TARGET 0 1 OCCUPATION_TYPE Accountants 14898 450 Cleaning staff 5571 361 Cooking staff 6142 663 Core staff 33963 2274 Drivers 19577 1926 HR staff 906 24 High skill tech staff 15995 727 IT staff 517 3 Laborers 70051 7177 Low-skill Laborers 744 294 Managers 26259 1689 Medicine staff 13582 867 Private service staff 2964 216 Realty agents 633 102 Sales staff 29312 2663 Secretaries 1933 211 Security staff 7279 626 Waiters/barmen staff 1486 65
Secteur d'activité
ft='ORGANIZATION_TYPE'
analyze_bi_cat_one (df_sampled, ft, 'TARGET')
TARGET 0 1 ORGANIZATION_TYPE Advertising 541 5 Agriculture 2026 98 Bank 2103 56 Business Entity Type 1 6521 1339 Business Entity Type 2 17695 1640 Business Entity Type 3 74210 7242 Cleaning 351 7 Construction 5845 813 Culture 664 0 Electricity 1718 82 Emergency 392 4 Government 15231 1194 Hotel 1154 24 Housing 4407 328 Industry: type 1 1456 199 Industry: type 10 375 46 Industry: type 11 3995 387 Industry: type 12 494 1 Industry: type 13 32 0 Industry: type 2 586 7 Industry: type 3 3625 707 Industry: type 4 1122 28 Industry: type 5 1102 29 Industry: type 6 33 4 Industry: type 7 1696 92 Industry: type 8 38 0 Industry: type 9 5470 192 Insurance 860 7 Kindergarten 8103 411 Legal Services 552 13 Medicine 17512 1449 Military 3717 219 Mobile 475 6 Other 22183 796 Police 2561 25 Postal 2742 116 Realtor 262 60 Religion 41 28 Restaurant 1741 83 School 12358 761 Security 4330 350 Security Ministries 2325 53 Self-employed 36976 2821 Services 1634 306 Telecom 465 71 Trade: type 1 464 67 Trade: type 2 1203 153 Trade: type 3 4066 179 Trade: type 4 254 0 Trade: type 5 70 0 Trade: type 6 503 0 Trade: type 7 6769 443 Transport: type 1 157 1 Transport: type 2 4562 540 Transport: type 3 987 151 Transport: type 4 6706 394 University 1859 82 XNA 54420 2865
Conclusions
Gender :
Situation familiale :
Progéniture :
Education :
Situation professionnelle :
Secteur d'activité :
Remarque
Nous remarquons la présence de modalités indéfinies dans la variable ORGANIZATION_TYPE => XNA
df_sampled.ORGANIZATION_TYPE.unique()
array(['Self-employed', 'Industry: type 11', 'Business Entity Type 3',
'Industry: type 7', 'Trade: type 7', 'Medicine', 'XNA',
'Construction', 'Postal', 'Kindergarten', 'Government', 'Services',
'Industry: type 12', 'Other', 'School', 'Transport: type 4',
'Business Entity Type 2', 'Industry: type 1', 'Housing',
'Business Entity Type 1', 'Industry: type 9', 'Transport: type 3',
'Restaurant', 'Industry: type 3', 'Agriculture', 'University',
'Transport: type 2', 'Security', 'Culture', 'Advertising',
'Electricity', 'Trade: type 2', 'Bank', 'Insurance', 'Mobile',
'Military', 'Hotel', 'Industry: type 10', 'Industry: type 6',
'Trade: type 3', 'Industry: type 4', 'Police', 'Telecom',
'Security Ministries', 'Industry: type 5', 'Industry: type 8',
'Industry: type 2', 'Realtor', 'Emergency', 'Trade: type 6',
'Legal Services', 'Trade: type 4', 'Cleaning', 'Trade: type 5',
'Trade: type 1', 'Transport: type 1', 'Industry: type 13',
'Religion'], dtype=object)
df_sampled.loc[df_sampled["ORGANIZATION_TYPE"] == "XNA", "ORGANIZATION_TYPE"] = df_sampled["ORGANIZATION_TYPE"].mode()[0]
Dans ce chapitre nous allons étudier les corrélations entre les variables quantitatives de type AMT_ et la TARGET.
Variables de type AMT_
Les revenus : variable AMT_INCOME_TOTAL
bins=[25650,112500,147150,202500,1170000]
group=['Low','Average','High', 'Very high']
df_sampled['Income_bin']=pd.cut(df_sampled['AMT_INCOME_TOTAL'],bins,labels=group)
Income_bin = pd.crosstab(df_sampled["Income_bin"],df_sampled["TARGET"])
print(Income_bin)
Income_bin.div(Income_bin.sum(1).astype(float),axis=0).plot(kind="bar",stacked=True,figsize=(6,6))
plt.xlabel("Revenus de demandeur")
plt.ylabel("Percentage")
plt.show()
TARGET 0 1 Income_bin Low 77496 6144 Average 59524 4360 High 107509 8034 Very high 109210 8436
Conclusion
Le montant du crédit : variable AMT_CREDIT
bins=[45000,270000,513531,808650,4050000]
group=['Low','Average','High', 'Very high']
df_sampled['Credit_bin']=pd.cut(df_sampled['AMT_CREDIT'],bins,labels=group)
Credit_bin = pd.crosstab(df_sampled["Credit_bin"],df_sampled["TARGET"])
print(Credit_bin)
Credit_bin.div(Credit_bin.sum(1).astype(float),axis=0).plot(kind="bar",stacked=True,figsize=(6,6))
plt.xlabel("Montant du crédit")
plt.ylabel("Percentage")
plt.show()
TARGET 0 1 Credit_bin Low 61387 4150 Average 80824 6575 High 100093 9307 Very high 111252 6942
Conclusion
Les crédits dont le montant était élevé et moyennement élevé (Average et High) auraientt été les plus impactés par les difficultés de paiement.
Il est intéressant de constater que les crédits les plus chers étaient les moins problématiques à rembourser.
Le prix du bien : variable AMT_GOODS_PRICE
bins=[40500,238500,450000,679500,4050000]
group=['Low','Average','High', 'Very high']
df_sampled['Price_bin']=pd.cut(df_sampled['AMT_GOODS_PRICE'],bins,labels=group)
Price_bin = pd.crosstab(df_sampled["Price_bin"],df_sampled["TARGET"])
print(Price_bin)
Price_bin.div(Price_bin.sum(1).astype(float),axis=0).plot(kind="bar",stacked=True,figsize=(6,6))
plt.xlabel("Prix du bien")
plt.ylabel("Percentage")
plt.show()
TARGET 0 1 Price_bin Low 68088 4880 Average 74092 6921 High 98758 8710 Very high 112768 6462
Conclusion
Les biens dont le prix était estimé comme moyen (Average) aurait été les plus impactés par les difficultés de paiement.
Il est intéressant de constater que les biens très chers étaient les moins problématiques à rembourser.
Les annuités : variable AMT_ANNUITY
bins=[1615,16524,24903,34596,2580255]
group=['Low','Average','High', 'Very high']
df_sampled['Annuity_bin']=pd.cut(df_sampled['AMT_ANNUITY_x'],bins,labels=group)
Annuity_bin = pd.crosstab(df_sampled["Annuity_bin"],df_sampled["TARGET"])
print(Annuity_bin)
Annuity_bin.div(Annuity_bin.sum(1).astype(float),axis=0).plot(kind="bar",stacked=True,figsize=(6,6))
plt.xlabel("Annuités")
plt.ylabel("Percentage")
plt.show()
TARGET 0 1 Annuity_bin Low 66902 3823 Average 87708 7463 High 102161 8727 Very high 96933 6961
Conclusion
Les personnes dont les annuités étaient élevées (High) auraient le plus de mal à payer le crédit.
Il est intéressant de constater que les annuités peu élevées mais également celles très élevées étaient les moins problématiques à rembourser.
df_sampled=df_sampled.drop(["Income_bin","Credit_bin","Price_bin","Annuity_bin"],axis=1)
Une façon d'aborder les corrélations entre les variables est la corrélation de Pearson. La méthode .corr() peut nous fournir une idée d'intensité dans les relations. Nous allons adopter l'interprétation suivante :
# Find correlations with the target and sort
correlations = df_sampled.corr()['TARGET'].sort_values()
# Display correlations
print('===> Most Positive Correlations:\n', round(correlations.tail(15),3))
print('\n\n===> Most Negative Correlations:\n', round(correlations.head(15),3))
===> Most Positive Correlations: AMT_CREDIT_SUM 0.007 AMT_CREDIT_LIMIT_ACTUAL 0.016 DAYS_CREDIT_ENDDATE 0.018 FLAG_WORK_PHONE 0.021 DAYS_ENDDATE_FACT 0.026 CNT_FAM_MEMBERS 0.029 CNT_CHILDREN 0.030 FLAG_EMP_PHONE 0.034 AMT_CREDIT_SUM_DEBT 0.036 MONTHS_BALANCE 0.042 AMT_BALANCE 0.043 REGION_RATING_CLIENT 0.064 REGION_RATING_CLIENT_W_CITY 0.064 TARGET 1.000 FLAG_MOBIL NaN Name: TARGET, dtype: float64 ===> Most Negative Correlations: EXT_SOURCE_3 -0.169 EXT_SOURCE_2 -0.155 EXT_SOURCE_1 -0.142 DAYS_BIRTH -0.072 DAYS_EMPLOYED -0.059 DAYS_CREDIT -0.051 CNT_INSTALMENT_MATURE_CUM -0.038 AMT_GOODS_PRICE -0.037 AMT_CREDIT -0.025 REG_REGION_NOT_LIVE_REGION -0.015 FLAG_CONT_MOBILE -0.012 AMT_ANNUITY_x -0.008 FLAG_PHONE -0.008 AMT_CREDIT_SUM_LIMIT -0.008 AMT_INCOME_TOTAL -0.008 Name: TARGET, dtype: float64
Nous allons utiliser une heatmap pour visualiser les corrélations :
TARGET versus les features corrélées positivementTARGET versus les features corrélées négativementTARGET versus les features corrélées positivement
correlations_head=df_sampled[["TARGET",
'AMT_CREDIT_SUM_DEBT',
'REG_CITY_NOT_WORK_CITY',
'AMT_CREDIT_LIMIT_ACTUAL',
'DAYS_CREDIT_ENDDATE',
'FLAG_WORK_PHONE',
'DAYS_ENDDATE_FACT',
'CNT_CHILDREN',
'CNT_FAM_MEMBERS',
'FLAG_EMP_PHONE',
'MONTHS_BALANCE',
'AMT_BALANCE',
'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY']]
matrix = correlations_head.corr()
#f, ax = plt.subplots(figsize=(10, 12))
sns.clustermap(matrix,
square=True, # make cells square
cbar_kws={'fraction' : 0.03}, # shrink colour bar
cmap='cividis', # use orange/red colour map
linewidth=1, # space between cells
cbar=True)
plt.show()
Nous remarquons une forte corrélation entre les features :
REGION_RATING_CLIENT et REGION_RATING_CLIENT_W_CITYCNT_CHILDREN et CNT_FAM_MEMBERSet une moins forte entre les features suivantes :
AMT_BALANCE et AMT_CREDIT_LIMIT_ACTUALTARGET versus les features corrélées négativement
correlations_tail=df_sampled[["TARGET",
'EXT_SOURCE_3',
'EXT_SOURCE_2',
'EXT_SOURCE_1',
'DAYS_BIRTH',
'DAYS_EMPLOYED',
'DAYS_CREDIT',
'AMT_GOODS_PRICE',
'CNT_INSTALMENT_MATURE_CUM',
'AMT_CREDIT',
'REG_REGION_NOT_LIVE_REGION',
'AMT_ANNUITY_x',
'FLAG_PHONE',
'AMT_CREDIT_SUM',
'FLAG_CONT_MOBILE',
'AMT_CREDIT_SUM_LIMIT']]
matrix = correlations_tail.corr()
#f, ax = plt.subplots(figsize=(10, 12))
sns.clustermap(matrix,
square=True, # make cells square
cbar_kws={'fraction' : 0.03}, # shrink colour bar
cmap='PiYG', # use orange/red colour map
linewidth=1, # space between cells
cbar=True)
plt.show()
Remarques
EXT_SOURCE et DAYS_BIRTH.Par la suite nous allons comparer l'influence des variables les plus fortement corrélées avec la cible.
Correlations positives
REGION_RATING_CLIENT versus TARGET
Comme vu précédemment, il s'agit probablement de la même donnée : REGION_RATING_CLIENT et REGION_RATING_CLIENT_W_CITY. Leur relation avec la cible est égale : 0,061. Visualisons cette relation grâce au plot KDE :
plt.figure(figsize = (10, 8))
# KDE des emprunts remboursés dans les temps
sns.kdeplot(df_sampled.loc[df_sampled['TARGET'] == 0, 'REGION_RATING_CLIENT_W_CITY'], label = 'Client 0')
# KDE des emprunts qui n'ont pas été remboursés dans les temps
sns.kdeplot(df_sampled.loc[df_sampled['TARGET'] == 1, 'REGION_RATING_CLIENT_W_CITY'], label = 'Client 1')
plt.xlabel('Rating')
plt.ylabel('Density')
plt.title('Distribution des scores')
plt.legend()
plt.show()
Correlations négatives
DAYS_BIRTH versus TARGET : -0,070
Il semblerait que plus l'âge augmente, moins le demandeur serait susceptible de faire défaut.
L'histogramme des âges a montré une répartition assez homogène des âges. Esayons de visualiser la corrélation à l'aide du plot KDE.
df_sampled['DAYS_BIRTH'] = abs(df_sampled['DAYS_BIRTH'])
plt.figure(figsize = (10, 8))
# KDE plot of loans that were repaid on time
sns.kdeplot(df_sampled.loc[df_sampled['TARGET'] == 0, 'DAYS_BIRTH'] / 365, label = 'Client 0')
# KDE plot of loans which were not repaid on time
sns.kdeplot(df_sampled.loc[df_sampled['TARGET'] == 1, 'DAYS_BIRTH'] / 365, label = 'Client 1')
# Labeling of plot
plt.xlabel('Age (années)')
plt.ylabel('Densité')
plt.title('Distribution d\'âges')
plt.legend()
plt.show()
Nous observons une asymétrie au niveau de l'âge jeune :
1)0)
Visiblement la variable DAYS_BIRTH a une influence sur la cible.Correlations négatives
Les 3 variables avec les corrélations négatives les plus fortes sont :
Leur définition : un score normalisé à partir sources de données extérieures.
Visualisons leur impact sur la cible :
plt.figure(figsize = (10, 12))
# itération sur 3 variables
for i, source in enumerate(['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3']):
# création subplot pour chaque variable
plt.subplot(3, 1, i + 1)
# plot des emprunts remboursés
sns.kdeplot(df_sampled.loc[df_sampled['TARGET'] == 0, source], label = 'Client 0')
# plot des emprunts non remboursés
sns.kdeplot(df_sampled.loc[df_sampled['TARGET'] == 1, source], label = 'Client 1')
plt.title('Distribution of %s by Target Value' % source)
plt.xlabel('%s' % source); plt.ylabel('Density'); plt.legend()
plt.tight_layout(h_pad = 2.5)
Conclusions
Parmi les trois variables, c'est la EXT_SOURCE_3 qui semble avoir la corrélation la plus élevée avec la cible (nous observons la plus grande différence entre les valeurs de la cible).
La corrélation n'est pas forte, en réalité toutes les relations sont considérées comme très faibles, mais pour autant elle pourra sûrement être utile dans un modèle de machine learning pour prédire si oui ou non un emprunteur remboursera dans les délais.
Analysons à présent les corrélations entre les variables les plus corrélées avec la cible :
# Extraction des variables EXT_SOURCE et des 3 autres variables les plus fortement corrélées avec la cible, et visualisation des corrélations
ext_data = df_sampled[['TARGET', 'EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'DAYS_BIRTH', 'REGION_RATING_CLIENT_W_CITY']]
ext_data_corrs = ext_data.corr()
plt.figure(figsize=(16,6))
sns.heatmap(ext_data_corrs, annot=True, fmt='.1g', linewidth=2, vmin=-0.4, vmax=0.6, center= 0, cmap='magma')
plt.title('Matrice de corrélation', size=18)
plt.show()
Conclusions
EXT_SOURCE ont une corrélation négative avec la cible0DAYS_BIRTH est fortement corrélée, de façon positive, avec EXT_SOURCE_1EXT_SOURCE_2 et REGION_RATING_CLIENTAvant la modélisation
df_sampled.drop(df_sampled[['SK_ID_CURR']], axis=1, inplace=True)
Fichier csv pour prétraitement.
# Sauvegarde du fichier pour prétraitement :
joblib.dump(df_sampled, 'df_clean_eda')
['df_clean_eda']